-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[GPU] Optimize merge memory usage #136411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
libs/simdvec/src/main/java/org/elasticsearch/simdvec/QuantizedByteVectorValuesAccess.java
Show resolved
Hide resolved
distribution/tools/server-cli/src/main/java/org/elasticsearch/server/cli/SystemJvmOptions.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/index/codec/vectors/reflect/VectorsFormatReflectionUtils.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldematte Great work, I have not tested it yet, but amazing work how you organized it. My main comment: do you think we can simplify this PR by breaking into two separate ones: making this PR only about changes to merges, and doing changes for flush, ResourcesHolder, 128Mb in a separate PR? Or these changes are tightly coupled?
...rc/main/java/org/elasticsearch/index/codec/vectors/reflect/VectorsFormatReflectionUtils.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/index/codec/vectors/reflect/VectorsFormatReflectionUtils.java
Show resolved
Hide resolved
x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
I can do that: here is the PR #136464 |
x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java
Show resolved
Hide resolved
@ldematte Great changes. I have done some benchmarking on my laptop with int8, and I see great recall but surprisingly no speedups as compared with main branch: gist: 1_000_000 docs; 960 dims; euclidean metric
cohere-wikipedia_v2: 934_024 docs; 768 dims; cosine metric
|
x-pack/plugin/gpu/src/main/java/org/elasticsearch/xpack/gpu/codec/ES92GpuHnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, @ldematte
@mayya-sharipova I also expected speed-ups on force merge; it seems to be a bit better, but it's some "%", not "x". |
@mayya-sharipova I updated merge as agreed, to avoid using directly device memory due to the cuVS bug. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldematte Thanks, the latest changes to copy to a separate memory segment LGTM
I have benchmarked merge performances for this, with both KnnIndexTester and ES. TL;DR: performances for merge are improved by 18/20%, but that gets "lost" in high level benchmarks (they give almost identical results, within the variance). BUT with this change, ES uses no additional disk space (which in the case of 1M vectors can be 5GB!) and in the case of float32 the memory footprint of the process (working set) is reduced too (int8 will also be fixed once cuvs if fixed). |
This PR changes how we gather and compact vector data for transmitting them to the GPU. Instead of using a temporary file to write out the compacted arrays, we use directly the vector values from the scorer supplier, which are backed by a memory mapped input. This way we avoid an additional copy of the data.
💚 Backport successful
|
This PR changes how we gather and compact vector data for transmitting them to the GPU. Instead of using a temporary file to write out the compacted arrays, we use directly the vector values from the scorer supplier, which are backed by a memory mapped input. This way we avoid an additional copy of the data.
This PR changes how we gather and compact vector data for transmitting them to the GPU. Instead of using a temporary file to write out the compacted arrays, we use directly the vector values from the scorer supplier, which are backed by a memory mapped input. This way we avoid an additional copy of the data.